Encode args in a info doc value rather than using placeholders #132782

parkertimmins · 2025-08-13T03:05:56Z

The existing method of determining where to insert arguments in the template is by replacing known placeholder values which were inserted in the template string. For example, the message found 5 errors would be separated into the argument 5, and the template found %W errors, where %W is the placeholder. There are a few problems with this method. First, we need special handling if the original message contains a placeholder string. We could handle this with some sort of escape, but this adds complexity, and costs time during ingestion. The second issue is that scanning for placeholders within the template string is slow: it is much faster to reconstruct the original message if we already know the location of the arguments in the template string.

This PR adds a new doc value column which encodes the location of all arguments in the template. For each argument, it stores the offset in the template string and the type of the argument. There is currently only one GENERIC argument type. These values are encoded in a base64 encoded string stored as SortedSetDocValues. Since messages with the same template will have arguments at the same location, and indices are sorted by template_id, this field compresses very well.

kkrik-es · 2025-08-13T04:56:42Z

.../src/main/java/org/elasticsearch/xpack/logsdb/patternedtext/PatternedTextValueProcessor.java

+        dataInput.writeVInt(arguments.size());
+        for (var arg : arguments) {
+            dataInput.writeVInt(arg.type.toCode());
+            dataInput.writeVInt(arg.offsetInTemplate);


Do we need the offset? If we keep the generic placeholder in the template, we can get the offsets from the latter?

I did test adding the placeholder to the template, then scanning the template before message reconstruction to find offsets. This was quite fast in a micro-benchmark, only about 2x slower that the stored offsets in this PR. But this method does not deal with the other issue which this PR addresses: handling messages which contain the placeholder value. There are certainly ways we could deal with this, but those seems complicated.

x-pack/plugin/logsdb/src/main/java/org/elasticsearch/xpack/logsdb/patternedtext/Arg.java

martijnvg

I think this is a good change that sets us up for the future.

x-pack/plugin/logsdb/src/main/java/org/elasticsearch/xpack/logsdb/patternedtext/Arg.java

martijnvg · 2025-08-18T03:37:42Z

...sdb/src/main/java/org/elasticsearch/xpack/logsdb/patternedtext/PatternedTextFieldMapper.java


+        // Add args schema
+        String argsSchemaEncoded = Arg.encodeSchema(parts.schemas());
+        context.doc().add(new SortedSetDocValuesField(fieldType().argsSchemaFieldName(), new BytesRef(argsSchemaEncoded)));


I think schema field name can be stored using SortedDocValuesField? Given that encodeSchema (...) stores the schemas as one value so store only one value per document?

Ideally, they should be able to be stored as regular SortedDocValues. This is true for all the doc values columns in the patterned_text type. But I ran into an issue where a mapper test class defined in Lucene did not handle SortedDocValues correctly. I submitted this fix: apache/lucene#14839, and it has been merged. If we're using 10.3, I could go ahead and update all doc values in this type to SortedDocValues. But I'm inclined to do it in a separate PR

Cool, and thanks for fixing this in Lucene 🚀

kkrik-es · 2025-08-18T10:35:35Z

.../src/main/java/org/elasticsearch/xpack/logsdb/patternedtext/PatternedTextValueProcessor.java

+        int prevArgOffset = 0;
        for (String token : tokens) {
            if (token.isEmpty()) {
+                // add the previous delimiter


I was wondering if we can use (double space) to track the presence of an arg, and replace all other whitespace with (single space) in the template. That will help avoid tracking offsets in arg schema and further compress the template, reducing the storage footprint. The reconstructed msg will look slightly different, but afaict whitespaces are barely used in term and phrase queries.

That may impact reconstruction performance, though. The logic is very streamlined, currently.

I am inclined to reconstruct the original message exactly, at least for the time being. This simplify testing as no changes need to be taken into account when testing for equality between synthetic and stored source.

x-pack/plugin/logsdb/src/main/java/org/elasticsearch/xpack/logsdb/patternedtext/Arg.java

...test/java/org/elasticsearch/xpack/logsdb/patternedtext/PatternedTextValueProcessorTests.java

elasticsearchmachine · 2025-08-28T03:14:11Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

martijnvg

LGTM 👍

…ic#132782) The existing method of determining where to insert arguments in the template is by replacing known placeholder values which were inserted in the template string. For example, the message found 5 errors would be separated into the argument 5, and the template found %W errors, where %W is the placeholder. There are a few problems with this method. First, we need special handling if the original message contains a placeholder string. We could handle this with some sort of escape, but this adds complexity, and costs time during ingestion. The second issue is that scanning for placeholders within the template string is slow: it is much faster to reconstruct the original message if we already know the location of the arguments in the template string. This PR adds a new doc value column which encodes the location of all arguments in the template. For each argument, it stores the offset in the template string and the type of the argument. There is currently only one GENERIC argument type. These values are encoded in a base64 encoded string stored as SortedSetDocValues. Since messages with the same template will have arguments at the same location, and indices are sorted by template_id, this field compresses very well.

Encode args in a binary schema instead of using placeholders

b9bff73

elasticsearchmachine added the v9.2.0 label Aug 13, 2025

[CI] Auto commit changes from spotless

c6f9551

kkrik-es reviewed Aug 13, 2025

View reviewed changes

parkertimmins and others added 4 commits August 14, 2025 13:45

Remove placeholder from template

957b2ce

Move args into separate class

0727314

[CI] Auto commit changes from spotless

3969361

Add encoder/decoder tests

90cf48c

parkertimmins commented Aug 15, 2025

View reviewed changes

x-pack/plugin/logsdb/src/main/java/org/elasticsearch/xpack/logsdb/patternedtext/Arg.java Outdated Show resolved Hide resolved

martijnvg approved these changes Aug 18, 2025

View reviewed changes

kkrik-es reviewed Aug 18, 2025

View reviewed changes

x-pack/plugin/logsdb/src/main/java/org/elasticsearch/xpack/logsdb/patternedtext/Arg.java Outdated Show resolved Hide resolved

kkrik-es reviewed Aug 18, 2025

View reviewed changes

...test/java/org/elasticsearch/xpack/logsdb/patternedtext/PatternedTextValueProcessorTests.java Outdated Show resolved Hide resolved

parkertimmins mentioned this pull request Aug 19, 2025

Cleanup and improvements for simple version of patterned_text #133146

Closed

5 tasks

parkertimmins added 2 commits August 27, 2025 15:20

Merge branch 'main' into parker/patterned-text-args-schema

874fbf7

Rename Arg.Schema to ArgInfo

e422af9

parkertimmins changed the title ~~Encode args in a schema doc value rather than using placeholders~~ Encode args in a info doc value rather than using placeholders Aug 27, 2025

parkertimmins and others added 4 commits August 27, 2025 15:57

Java doc and a bit of renaming

ff5fae1

Handle offset from previous within encode/decode

f235970

[CI] Auto commit changes from spotless

4c6aa6b

Update template_id yaml tests due to template change

ed5bb78

parkertimmins marked this pull request as ready for review August 28, 2025 03:11

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Aug 28, 2025

parkertimmins added >non-issue :StorageEngine/Mapping The storage related side of mappings and removed needs:triage Requires assignment of a team area label labels Aug 28, 2025

elasticsearchmachine added the Team:StorageEngine label Aug 28, 2025

parkertimmins requested review from kkrik-es and martijnvg August 28, 2025 03:19

Cleanup some things missed during rename

ceeddb4

martijnvg approved these changes Aug 28, 2025

View reviewed changes

parkertimmins added 3 commits August 28, 2025 09:45

Merge branch 'main' into parker/patterned-text-args-schema

b0a3c59

Merge branch 'main' into parker/patterned-text-args-schema

68144ea

Merge branch 'main' into parker/patterned-text-args-schema

909adbd

parkertimmins merged commit 3b25b97 into elastic:main Aug 28, 2025
33 checks passed

Encode args in a info doc value rather than using placeholders #132782

Encode args in a info doc value rather than using placeholders #132782

Uh oh!

Conversation

parkertimmins commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kkrik-es Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

parkertimmins Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

martijnvg Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

parkertimmins Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

martijnvg Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

kkrik-es Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

parkertimmins Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Aug 28, 2025

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

parkertimmins commented Aug 13, 2025 •

edited

Loading

kkrik-es Aug 18, 2025 •

edited

Loading